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BAYESIAN  NONPARAMETRIC  PREDICTION 
AND  STATISTICAL  INFERENCE 

Bruce  M.  Hill " 

September  7,  1989 


Abstract 

The  problem  of  Bayesian  nonparametric  prediction  and  statistical  in¬ 
ference  is  formulated  and  discussed.  A  solution  is  proposed  based  upon 
A„  and  H„  as  in  Hill  (1968).  The  meaning  of  parameters  in  the  subjec¬ 
tive  Bayesian  theory  of  Bruno  de  Finetti  is  discussed  in  connection  both 
with  A„  and  with  conventional  parametric  models.  It  is  argued  that  the 
usual  sharp  distinction  between  prediction  and  parametric  inference  is 
largely  illusory.  The  finite  version  of  de  Finetti’s  theorem  is  emphasized 
for  the  practice  of  statistics,  with  the  infinite  case  used  only  to  obtain 
approximations  and  insight. 


1  Introduction 

Bayesian  nonparametric  statistics  consists  of  methods  for  statistical  inference 
and  prediction  based  upon  v.'eak  apriori  knowledge  as  to  the  form  of  the  under¬ 
lying  population.  In  real  world  problems  one  typically  does  not  have  the  type  of 
sharp  apriori  knowledge  usually  assumed  about  models.  Indeed,  it  is  well  known 
that  in  the  practice  of  statistics  the  most  difficult  and  important  phase  consists 
of  the  specification  of  such  models.  In  this  article  we  wish  to  discuss  the  case  in 
which  it  is  difficult,  or  impossible,  to  model  the  data  in  terms  of  conventional 
parametric  models  such  as  the  Gaussian,  exponential,  or  even  exponential  fam¬ 
ily,  at  least  without  resorting  to  complex  mixtures  of  such  distributions.  Our 
inference  will  instead  be  based  upon  the  nonparametric  Bayesian  approach 
of  Hill  (1968,  1980a.  1988b,  1987b).  A  version  of  this  approach  was  originally 
suggested  from  a  fiducial  point  of  view  by  R.  A.  Fisher  (1939.  1948).  See  also 
Dempster  (1963). 

In  his  celebrated  article  La  Prevision  (1937),  Bruno  de  Finetti  proposed  a 
subjective  Bayesian  solution  to  the  problem  of  scientific  induction,  as  formu¬ 
lated,  for  example,  by  the  Scottish  philosopher,  David  Hume  i  1748).  De  Finetti 
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did  so  in  terms  of  the  concept  of  exchangeability,  which  is  a  special  form  of 
dependence  that  he  introduced  and  studied  extensively  (1937).  Other  key  refer¬ 
ences  are  Hewitt  and  Savage  (1955),  Savage  (1972),  Heath  and  Sudderth  (1976), 
and  Diaconis  and  Freedman  (1980,  1981).  In  this  article  I  shall  first  give  a  some¬ 
what  personal  review  of  the  history  and  substance  of  the  connection  between 
induction  and  subjectivistic  perceptions  of  symmetry,  with  particular  attention 
to  An  and  Hn,  which  I  developed  for  the  case  of  vague  or  diffuse  prior  knowledge 
as  to  the  shape  of  the  underlying  distribution  of  the  observables. 

The  problem  of  induction  is  the  problem  of  drawing  inference  about  the 
future  based  upon  the  past.  This  problem  has  long  plagued  philosophers  and 
others,  partly  because  there  is  no  way  to  prove  that  induction  works  (apart  from 
induction  itself),  and  also  because  in  the  real  world  it  can  be  extremely  difficult 
to  formulate  inferential  or  decision  procedures,  i.e.,  inductive  techniques,  that 
are  appropriate  in  a  given  situation.  The  problem  is  best  thought  of  in  terms 
of  the  probabilistic  prediction  of  potentially  observable  random  quantities  (not 
necessarily  exchangeable),  say  Given  the  values  of  the  first  n  ob¬ 

servations,  A'i  =  *i, ... ,  A'n  =  im  what  can  we  say  about  A*n+1  or  any  other 
future  observations?  In  the  Bayesian  approach  this  is  done  in  terms  of  the  eval¬ 
uation  of  a  probability  distribution  for  the  future  observables,  given  the  data  A'i 
=  Xi,...,A'„  =  x„.  Conventional  Bayesian  methods,  using  a  prior  distribution 
for  a  ‘parameter,’  such  as  the  parameter  of  a  Bernoulli  sequence,  yield  such  a 
predictive  posterior  distribution,  in  addition  to  the  more  customary  posterior 
distribution  for  the  parameter.  In  such  situations,  once  a  statistical  model  and 
prior  distribution  have  been  formulated  and  specified,  the  posterior  distribution 
of  the  future  observations,  given  the  past,  is  thereby  completely  determined. 
Such  a  scheme  may  be  called  inductive,  since  it  prescribes  a  (coherent)  mode  of 
inference  and  behavior  with  respect  to  the  future  observables,  given  any  set  of 
data. 

This  scheme,  as  usually  interpreted,  requires  that  there  exist  ‘true’  known 
probabilities  that  represent  the  conditional  distribution  of  the  data,  given  the 
parameter,  i.  e.,  a  conventional  statistical  model.  However,  at  the  deepest  level, 
where  ‘true’  probabilities  either  do  not  exist,  or  even  if  in  some  as  yet  unknown 
sense  they  do  exist,  they  are  at  least  unknown,  the  conventional  model-based 
Bayesian  theory  is  incomplete,  since  it  is  difficult  even  to  give  operational  mean¬ 
ing  to  the  assertion  that  a  particular  model  is  ‘true,’  much  less  to  find  such  a 
model.  The  problem  that  de  Finetti  clearly  formulated  and  largely  solved  was 
the  problem  of  giving  meaning  to  Bayesian  inferential  procedures  without  rely¬ 
ing  upon  the  usual  crutch  of  an  assumed  statistical  model.  Before  the  funda¬ 
mental  work  of  de  Finetti,  the  assumption  of  such  a  true  model  was  simply  an 
unjustified  act  of  faith.  One  could,  of  course,  refer  to  some  underlying  physical 
theory,  or  to  the  central  limit  theorem,  or  some  previous  analogous  data,  to 
support  belief  in  such  a  model.  But  deep  down  this  remained  at  best  a  mat¬ 
ter  of  delicate  subjective  judgment,  and  it  was  not  even  clear  how  to  express 
what  such  subjective  judgments  concerned.  For  example,  consider  the  use  of 
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the  normal  or  Gaussian  distribution.  Poincare  (1912,  p.  171)  states  in  con¬ 
nection  with  this  distribution,  “Tout  le  monde  y  croit  cependant,  me  disait  un 
jour  M.  Lippmann,  car  les  experimentateurs  s’imaginent  que  c’est  un  theoreme 
de  mathematiques,  et  les  mathematiciens  que  c’est  un  fait  experimental,”  or 
“everybody  believes  in  the  law  of  errors,  the  experimenters  because  they  think 
it  is  a  mathematical  theorem,  and  the  mathematicians  because  they  think  it  is 
an  experimental  fact.”  In  the  real  world  it  is  justified,  in  fact,  by  neither.  In 
Hill  (1969)  it  is  shown  that  the  use  of  the  normal  distribution  can  instead  be 
based  simply  upon  a  subjective  judgment  of  spherical  symmetry  for  the  ‘actual’ 
errors  in  the  observations.  See  also  Borel  (1914,  p.  66,  90-93)  and  Borel  (1906). 

Hill  (1969,  p.  95)  gives  the  exact  density  for  the  marginal  distribution  of  n 
coordinates  based  upon  spherical  symmetry,  or  conditional  uniformity,  on  the 
N-dimensional  sphere.  By  Scheffe’s  lemma  that  convergence  of  densities  to  a 
proper  density  implies  convergence  in  distribution,  it  immediately  follows  that 
each  fixed  r-dimensional  marginal  distribution  of  the  joint  distribution  of  the  n 
coordinates  converges  to  the  Gaussian,  even  as  n  goes  to  infinity  as  well  as  N. 

In  this  sense  spherical  symmetry  implies  approximate  normality.  It  should  be 
noted  that  my  statement  of  the  result,  which  is  for  the  case  of  spherical  symme¬ 
try  without  a  constraint  on  the  average  of  all  N  coordinates,  agrees  with  that  of 
Borel,  who  discovered  and  stated  the  result  for  the  case  of  one  coordinate,  and 
appears  to  have  understood  the  general  case.  When  there  is  also  a  constraint  on 
the  average  of  the  N  coordinates,  then  my  exponent  N  -  n  -  2  should  be  changed 
to  N  -  n  -  3. 

In  the  theory  that  I  proposed  spherical  symmetry,  or  more  generally,  con¬ 
ditional  uniformity  on  surfaces,  is  itself  only  an  approximation  based  upon  the 
available  knowledge,  and  does  not  purport  to  be  more  than  this,  or  to  have  any 
other  objective  meaning.  For  example,  in  the  case  of  errors  of  measurement, 
one  may  view  the  usual  orthogonal  axes  of  a  coordinate  system  as  arbitrary, 
and  therefore  introduce  rotational  symmetry.  Ultimately,  it  is  simply  a  matter 
of  judging  that  spherical  symmetry  represents  a  sufficiently  good  approximation 

to  one’s  opinions  in  order  to  be  useful  for  inference,  prediction,  and  decision-  _ 

making.  In  my  opinion  there  is  no  hope  to  demonstrate  that  such  a  judgment  f  oTlc 

is  either  ‘correct’  or  ‘incorrect,’  other  than  empirically,  for  example,  by  seeing  |  "  e 
how  well  it  works  predictively. 

At  an  even  more  basic  level,  as  de  Finetti  first  realized,  induction  can  be  J 

based  upon  a  direct  subjective  judgment  of  exchangeability  for  the  sequence  of  1 

observables.  Once  this  subjective  judgment  is  made,  it  is  a  mathematical  fact  — _ 

that  one  wiil  be  acting  (nearly)  as  though  some  statistical  model  were  true.  a*99lon  for 
Conditional  upon  the  parameters  of  such  a  model,  the  data  will  be  regarded  as  [S  GRA&I 
independent  and  identically  distributed,  if  the  exchangeable  sequence  is  infinite,  *C  TAB 
and  approximately  so  if  it  is  a  sufficiently  long  finite  sequence.  See  Diaconis  and  nnounced 
Freedman  (1980).  (It  should  be  noted  that  the  conventional  assumption  of  in-  tlflcatlon. 

dependent,  identically  distributed  observations,  with  an  unknown  distribution, - 

corresponds  to  the  subjective  Bayesian  assumption  of  exchangeability.)  This  is 
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de  Finetti’s  theorem,  and  its  significance  is  that  it  provides  a  subjective  justifica¬ 
tion  for  the  use  of  statistical  models  and  for  conventional  model-based  Bayesian 
inference,  provided  that  care  is  taken  in  the  interpretation  of  such  techniques. 
See  Savage  (1972)  for  a  treatment  of  exchangeability  from  this  point  of  view. 
De  Finetti  stressed  the  fact  that  the  judgment  of  exchangeability  is  itself  only  a 
subjective  judgment,  and  perhaps  only  an  approximation  to  one’s  actual  opin¬ 
ions.  To  the  extent  that  one  judges  the  sequence  as  exchangeable,  then  one  is 
led  to  conventional  Bayesian  inferential  techniques.  Often,  in  fact,  approximate, 
or  even  partial  exchangeability,  will  suffice  to  justify  such  techniques.  Further¬ 
more,  as  will  be  discussed  in  Section  3,  I  believe  that  it  is  necessary  to  integrate 
the  conventional  Bayesian  theory  with  data-analytic  methods  for  the  selection 
of  models,  parameters,  and  hypotheses.  See  Hill  (1987a)  for  discussion  of  the 
deeper  structures  that  may  underlie  conventional  statistical  models,  Hill  (1985) 
for  Bayesian  selection  of  models,  and  Hill  (1988)  for  a  theory  of  Bayesian  data 
analysis. 

In  a  practical  sense,  based  upon  the  subjective  judgment  of  exchangeabil¬ 
ity,  de  Finetti  had  completely  solved  the  problem  of  inductive  inference  for  the 
case  of  Bernoulli  data,  or,  more  generally,  for  multinomial  data  with  a  known 
finite  number  of  categories.  Combined  with  the  beautiful  result  of  W.  E.  John¬ 
son  (1932),  as  discussed  in  Zabell  (1982),  there  was  little  more  to  be  said  at 
the  foundational  or  even  practical  level  for  these  cases,  other  than  to  elabo¬ 
rate  on  the  choice  of  prior  distribution  for  the  Bernoulli  parameter  p  or  for  the 
parameter  8  of  a  multinomial  distribution.  Thus  the  precise  measurement  (or 
stable  estimation)  argument  of  L.  J.  Savage  (1961,  p.  Ch.  4;  1962,  p.20),  or 
as  presented  in  Degroot  (1970,  p.  199),  deals  with  the  case  in  which  the  prior 
distribution  is  diffuse  relative  to  the  likelihood  function,  so  that  it  is  of  little 
consequence,  and  the  posterior  density  for  the  parameter  can  be  approximated 
by  the  likelihood  function.  On  the  other  hand,  H.  Jeffreys’s  theory  of  hypothesis 
testing  covers  the  most  important  situations  in  which  the  prior  is  not  diffuse. 
See  Edwards,  Lindman  and  Savage  (1963),  and  Hill  (1974a,  1982)  for  discus¬ 
sions.  The  problem  of  so-called  ‘uninformative’  priors  has  also  been  dealt  with 
very  effectively  by  a  number  of  people  for  the  case  of  multinomial  data.  See 
Good  (1965,  Ch.  4)  for  a  review  and  discussion.  Furthermore,  it  has  long  been 
recognized  that  many  real  world  problems  can  be  adequately  modelled  by  such 
finite  partitions,  Fisher  (1959,  p.  Ill  ),  Savage  (1961,  p.  4.23),  so  when  this 
can  be  done  effectively  there  is  available  a  more  or  less  complete  system  (apart 
from  details  and  various  complications  that  arise  in  practice)  for  inductive  in¬ 
ference  and  decision-making,  including  the  prediction  of  future  observations.  In 
de  Finetti’s  own  words  (1937,  p.  147):  “It  is  thus  that  when  the  subjectivistic 
point  of  view  is  adopted  ,  the  problem  of  induction  receives  an  answer  which  is 
naturally  subjective  but  in  itself  perfectly  logical,  while  on  the  other  hand,  when 
one  pretends  to  eliminate  the  subjective  factors  one  succeeds  only  in  hiding  them 
(that  is,  at  least,  in  my  opinion),  more  or  less  skillfully,  but  never  in  avoiding 
a  gap  in  logic.  It  is  true  that  in  many  cases-as  for  example  on  the  hypothesis 
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of  exchangeability-these  subjective  factors  never  have  too  pronounced  an  in¬ 
fluence,  provided  that  the  experience  be  rich  enough;  this  circumstance  is  very 
important,  for  it  explains  how  in  certain  conditions  more  or  less  close  agreement 
between  the  predictions  of  different  individuals  is  produced,  but  it  also  shows 
that  discordant  opinions  are  always  legitimate.  This  does  not  make  any  change 
in  the  purely  subjective  character  of  the  whole  theory  of  probability.”  See  also 
de  Finetti  (1974,  Ch.  11). 

Thus  in  the  exchangeable  case  the  only  type  of  situation  that  had  not  been 
essentially  resolved  was  that  in  which  no  finite  partition  model  was  appropri¬ 
ate,  or  more  generally,  when  the  number  of  parameters  requisite  realistically 
to  model  the  data  is  large  relative  to  the  number  of  observations.  One  can 
speak  of  this  either  in  terms  of  multinomial  data  with  an  infinite  number  of 
categories,  or  alternatively,  in  terms  of  an  unknown  and  possibly  quite  large 
or  even  infinite  number  of  categories.  Still  again,  to  suggest  the  general  type 
of  problem,  one  can  speak  of  Bayesian  nonparametric  statistics,  or  of  Bayesian 
inference  about  an  ‘unknown’  distribution  function.  Whatever  words  we  may 
use,  what  we  are  trying  to  describe  is  the  situation  in  which  no  conventional 
parametric  statistical  model  is  thought  to  be  appropriate  for  the  exchangeable 
sequence  of  observations.  This  situation  seems  often  to  arise  in  the  p.actice  of 
statistics.  Indeed,  from  my  own  point  of  view,  which  will  be  explained  further 
below,  they  in  fact  represent  the  great  majority  of  statistical  situations,  with 
Gaussian  and  other  conventional  parametric  models  being  appropriate  only  in 
very  limited  contexts. 

What  then  can  be  said  about  the  nonparametric  case  from  a  subjective 
Bayesian  point  of  view?  The  first  thing  to  observe  is  that  de  Finetti’s  theorem 
still  holds,  so  that  in  the  case  of  an  infinite  exchangeable  sequence  of  observables, 
one  will  be  mixing  over  a  dummy  variable  that  represents  the  ‘unknown’  distri¬ 
bution,  say  F,  in  the  population.  De  Finetti  (1937,  Ch.  4),  had  already  given 
an  insightful  development  of  the  mathematics  of  this  situation  for  exchangeable 
random  quantities.  Diaconis  and  Freedman  (1980,  1981)  have  presented  easily 
accessible  proofs  for  even  more  general  cases.  Just  as  in  the  case  of  exchangeable 
events,  what  must  be  specified  in  order  to  implement  the  Bayesian  approach,  is 
the  mixing  function,  or  apriori  distribution  for  F.  In  the  case  of  exchangeable 
events,  F  is  concentrated  at  only  two  known  values,  0  and  1,  while  in  the  present 
case  the  distribution  F  can  in  principle  be  any  distribution  function  on  the  real 
line.  Special  mixing  distributions  will  correspond  to  the  subset  of  F  appropriate 
for  a  finite  multinomial  situation,  in  which  the  categories  are  coded  numerically, 
or  for  sure  knowledge  that  F  is  Gaussian,  etc.  The  fully  nonparametric  case  is 
that  in  which  F  cannot  be  restricted  to  such  special  subsets  of  the  space  of  all 
distribution  functions.  In  principle,  what  the  Bayesian  must  do  is  to  specify 
a  prior  distribution,  it,  on  the  space  of  all  distribution  functions,  F:  and  then, 
given  the  data,  update  this  prior  distribution  to  become  a  posterior  distribution, 
tt’,  in  accord  with  Bayes’s  theorem. 

How  then  can  a  subjective  Bayesian  specify  a  prior  distribution  that  ex- 
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presses  a  realistic  degree  of  vagueness  about  F?  Since  we  are  dealing  with  pos¬ 
sibly  infinitely  many  parameters,  it  is  clear  that  the  problem  is  formidable  even 
from  the  point  of  view  of  the  mathematics  involved,  and  of  course  even  much 
more  so  conceptually.  Furthermore,  after  observing  a  sample  from  the  popula¬ 
tion,  and  obtaining  ir*,  in  order  to  obtain  the  posterior  predictive  distribution  of 
future  observations,  one  would  have  to  integrate  the  conditional  distribution  of 
the  future  observations,  given  F,  with  respect  to  this  posterior  distribution  of  F. 
Here  the  ‘unknown’  F  plays  the  same  role  as  the  ‘unknown’  8  of  a  conventional 
multinomial  model,  but  the  mathematics  is  again  enormously  more  complicated. 
In  addition,  a  basic  difficulty  arises  here  that  did  not  appear  in  the  case  of  finite 
multinomial  models.  It  is  no  longer  the  case  that  one  can  rely  on  some  form 
of  Savage’s  precise  measurement  argument.  Thus  the  distribution  function  F 
may  have  infinitely  many  parameters,  and  no  matter  how  large  a  finite  sample 
is  taken  from  the  population,  the  prior  distribution  for  F  may  still  play  a  crucial 
role.  Even  if,  more  realistically,  we  regard  F  as  having  a  large  finite  number 
of  parameters,  in  a  practical  sense  the  same  phenomenon  occurs,  since  realistic 
sample  sizes  will  be  small  relative  to  the  total  number  of  parameters.  See  Hill 
(1975b)  for  a  discussion  of  this  phenomenon.  Typically  there  is  no  such  thing 
as  global  robustness  (for  all  possible  distributions  of  F),  or  in  other  words,  the 
posterior  distribution  may  be  extremely  sensitive  to  the  prior  distribution  for 
F. 

The  problem  is  not,  however,  so  hopeless  of  solution  as  may  first  appear.  The 
first  modern  day  hint  or  suggestion  as  to  the  nature  of  a  possible  solution  occurs 
in  the  work  of  R.  A.  Fisher  (1939,  1948),  who  proposed  a  fiducial  interpretation 
for  what  I  later  called  A„.  and  who  gives  credit  to  ‘Student’  for  the  underlying 
idea. 

Consider  a  conventional  formulation  ot  statistical  inference,  in  which  the 
observations  are  conditionally  independent  with  cumulative  distribution  func¬ 
tion  F{x\<t>),  where  4>  is  a  conventional  unknown  parameter.  Assume  that  the 
distribution  function  is  continuous  in  x  for  each  <j>.  Let  A'(l)  denote  the  ascend¬ 
ing  order  statistics  of  the  data,  for  i  =  l,...,n.  Then  let  0,  =  F(X(,j  ;  <t>)  - 
F(A(,_1); <t>),  for  i  =  l,...,n+  1,  where  by  definition  A'(0)  =  -  oo  and  A'(n+1) 
=  oo.  Before  the  data  are  drawn,  clearly  the  distribution  of  the  8,  is  a  uniform 
distribution  on  the  n-dimensional  simplex,  i  .  e.,  a  special  Dirichlet  distribution 
in  which  all  the  parameters  are  equal  to  unity.  This  is  the  fundamental  frequen- 
tistic  intuition  with  regard  to  An ,  and  which  Fisher  presumably  used  to  put 
forth  his  proposed  fiducial  solution.  Thus  Fisher  suggested  (or  implied)  that 
even  when  the  random  variables  A'(t)  are  replaced  by  their  observed  values  X(i), 
that  the  uniform  distribution  for  the  6,  would  still  be  appropriate.  It  should  be 
mentioned  that  the  articles  by  Fisher  (1939,  1948)  only  briefly  and  cryptically 
discuss  the  nonparametric  fiducial  case  that  we  are  concerned  with.  The  first 
clear  formal  statement  of  something  like  An  is  by  Dempster  (1963),  who  stated 
the  Fisherian  argument  precisely,  changed  the  name  from  fiducial  to  ‘direct’ 
probability,  and  applied  the  argument  for  prediction  of  future  observables  as 
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well.  Dempster  also  asserted  that  what  I  later  called  An  “does  not  appear  to 
have  a  Bayesian  interpretation.” 

In  my  1968  article  I  showed  that  in  fact  An  does  have  a  Bayesian  interpreta¬ 
tion.  Before  discussing  this,  however,  let  me  note  that  Fisher’s  proposed  fiducial 
distribution  is  an  example  of  a  posterior  predictive  distribution,  since  whatever 
the  rationale,  it  is  posterior  to  the  data,  and  does  partially  specify  a  probability 
distribution  for  the  future  data.  This  predictive  distribution  is  not  completely 
specified,  since  what  it  does  is  to  attach  a  probability  of  to  each  of  the  n  -+- 
1  open  intervals  formed  by  the  consecutive  order  statistics  of  the  given  sample, 
assuming  that  there  are  no  ties,  and  goes  no  further.  The  fiducial  argument  that 
Fisher  gave  for  this  evaluation  depends  upon  one’s  willingness  to  persist  with  the 
pre-data  evaluation  of  the  distribution  of,  say,  F(X^y,<t>)  —  F(X^_^y,d>),  after 
the  X(t)  are  replaced  by  their  observed  numerical  values.  Such  a  fiducial  argu¬ 
ment,  although  intriguing,  was  logically  suspect.  See  Edwards  (1972,.  p.  207)  for 
a  totally  devastating  example  against  the  logic  of  the  fiducial  argument.  Also, 
Lindley  (1958)  had  already  shown,  under  certain  special  conditions,  that  the 
fiducial  argument  leads  to  a  genuine  posterior  distribution  only  in  simple  cases 
reducible  to  that  of  a  location  parameter.  However,  Fisher,  in  a  footnote  (1959, 
p.51),  asserted  that  “Probability  statements  derived  by  arguments  of  the  fidu¬ 
cial  type  have  often  been  called  statements  of  ‘fiducial  probability’.  This  usage 
is  a  convenient  one  so  long  as  it  is  recognized  that  the  concept  of  probability 
involved  is  entirely  identical  with  the  classical  probability  of  the  early  writers, 
such  as  Bayes.  It  is  only  the  mode  of  derivation  which  was  unknown  to  them.” 

In  short,  the  situation  with  regard  to  An  was  anything  but  clear,  and  at  the 
time  it  was  not  even  known  whether  An  was  a  coherent  evaluation  in  the  sense 
of  de  Finetti.  If  it  was,  then  presumably  it  could  have  been  derived  by  means  of 
a  prior  distribution  for  F,  Bayes  theorem,  and  an  integration  with  respect  to  the 
posterior  distribution  of  F,  as  discussed  above.  The  problem  that  I  addressed  in 
my  1968  article  was  that  of  giving  such  a  derivation  of  A The  first  step  in  my 
formulation  was  to  consider  the  case  of  arbitrary  finite  populations,  in  which 
the  size  of  the  population  may  be  unknown,  rather  than  infinite  populations,  or 
in  other  words,  to  deal  with  finite  exchangeable  sequences. 

This  step  is  not  only  more  realistic,  since  in  the  real  world  we  are  not  or¬ 
dinarily  called  upon  to  deal  with  more  than  finite  populations  or  sequences  of 
observables,  but  it  also  greatly  simplifies  the  mathematics.  Indeed,  the  proofs 
of  de  Finetti’s  theorem  for  the  infinite  case  by  Heath  and  Sudderth  (1976)  and 
by  Diaconis  and  Freedman  (1980,  1981),  proceed  by  taking  limits  for  the  finite 
case.  Thus  for  the  case  of  events,  one  can  condition  on  the  sum  of  the  indicators 
after  N  observations,  and  note  that  because  of  exchangeability,  all  paths  to  a 
given  sum  are  equally  likely.  The  exchangeable  distribution  of  any  N  indicators 
(for  example,  for  red  and  white  balls  in  an  urn)  is  then  identical  with  one  that 
arises  from  sampling  without  replacement  from  a  ‘randomly  selected’  urn  with 
N  balls,  i.  e.,  the  draws  are  made  from  an  urn  with  composition  (R,W),  R  + 
W  =  N,  with  a  probability  equal  to  the  original  subjective  probability  that  the 
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sum  of  the  N  indicators  is  R.  More  generally,  one  can  consider  the  empirical 
distribution  function  that  arises  from  ‘sampling’  the  first  N  coordinates  from  an 
exchangeable  sequence.  Conditional  on  this  empirical  distribution  function,  the 
individual  coordinates  are  distributed  uniformly  over  the  collection  of  N-tuples 
having  this  empirical  distribution.  Because  sampling  without  replacement  is, 
for  large  N,  close  to  sampling  with  replacement  from  this  empirical  distribution, 
and  because  of  the  convergence  of  this  empirical  distribution  to  some  limit¬ 
ing  distribution  (with  probability  one,  under  exchangeability),  one  obtains  de 
Finetti’s  theorem  for  the  infinite  sequence  in  this  way.  See  Diaconis  and  Freed¬ 
man  (1980.  p.  749;  1981,  p.  209)  for  details.  These  authors  have  emphasized  the 
importance  of  the  finite  case  even  for  the  underlying  mathematics,  and  as  I  shall 
argue  later,  it  is  also  the  appropriate  formulation  for  inferential  purposes.  The 
model  for  An  in  Hill  (1968,  p.  679)  is  actually  equivalent  to  the  specification  of 
a  diffuse  prior  distribution  for  the  empirical  distribution  of  a  finite  population. 
In  the  context  of  Heath  and  Suddert’n,  or  of  Diaconis  and  Freedman,  it  is  equiv¬ 
alent  to  specifying  the  subjective  distribution  for  the  sufficient  statistic  based 
upon  N  trials  within  their  models.  Thus  instead  of  specifying  a  diffuse  prior 
distribution  on  the  space  of  all  distribution  functions  F,  what  I  have  done  is  to 
specify  such  a  prior  distribution  for  the  empirical  distribution  function  of  the 
entire  finite  population  from  which  a  simple  random  sample  has  been  drawn. 
Here  the  number  of  units  in  the  finite  populations  can  be  unknown,  as  well  as 
the  number  of  jump-points  of  the  empirical  distribution  of  the  population,  and 
the  points  at  which  the  jumps  occur  and  the  sizes  of  the  jumps.  The  details 
concerning  this  diffuse  prior  distribution  will  be  given  in  Section  2.  Here  what 
I  want  to  discuss  is  the  underlying  sampling  model. 

The  statistical  model  for  this  problem  can  be  thought  of  in  terms  of  sampling 
with  or  without  replacement  from  a  finite  population  of  units.  Imagine  that  each 
unit  carries  an  attached  tag  or  label,  for  example  giving  the  color  of  the  unit, 
or  the  name  of  the  species  to  which  the  unit  belongs,  or  a  numerical  value  such 
as  the  mass  of  the  unit,  or  the  future  time  of  death  of  the  unit.  It  does  no  harm 
to  visualize  the  population  of  units,  with  their  attached  labels,  as  sitting  in  an 
urn.  A  simple  random  sample,  with  or  without  replacement,  is  then  drawn,  and 
we  observe  the  value  of  the  label  for  each  unit  in  the  sample.  It  is  assumed 
for  simplicity  here  that  the  label  or  numerical  value  is  observed  without  error, 
although  the  theory  can  easily  be  extended  to  deal  with  errors  of  measurement. 

The  population  of  labels  or  numerical  values  can  be  described  in  terms  of  the 
empirical  distribution  of  such  labels  or  values.  Indeed,  because  we  are  dealing 
with  only  finite  populations,  the  case  of  colors  or  names  can  be  viewed  as  a 
special  case  of  the  case  of  numerical  values.  Thus  we  can  imagine,  without  loss 
of  generality,  that  the  finite  collection  of  colors  or  names  in  the  entire  population 
have  been  encoded  numerically,  thus  yielding  numerical  values.  For  simplicity, 
we  shall  describe  the  situation  for  the  case  of  such  numerical  values.  When  we 
return  to  the  case  of  'colors.'  as  in  the  species  sampling  problem,  we  shall  point 
out  the  special  features  that  arise  in  this  case.  For  the  time  being,  visualize  the 
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urn  population  as  consisting  of  the  numerical  values  attached  to  the  units,  such 
as  their  masses.  This  population  can  then  be  described  in  terms  of  the  number  of 
units  in  the  population,  say  N,  and  the  empirical  distribution  of  the  values  in  the 
population.  Note,  for  example,  that  if  sampling  is  with  replacement  from  this 
population,  and  if  the  number  of  distinct  values  in  the  population  is  known  to 
be  say,  M,  then  this  model  is  a  special  case  of  a  conventional  multinomial  model 
with  exactly  M  non-empty  categories.  In  general,  of  course,  M  need  not  be 
known,  except  that  M  <  N ,  and  sampling  can  be  without  replacement.  In  any 
case,  the  number  of  units  in  the  population,  N,  and  the  empirical  distribution 
of  population  values,  completely  characterize  the  finite  population  of  values.  In 
fact  here  the  empirical  distribution  for  the  entire  finite  population  of  values  plays 
a  similar  role  to  that  of  the  ‘unknown’  probabilities  in  a  conventional  statistical 
model.  Following  the  spirit  of  de  Finetti  (1937),  I  regard  the  fundamental 
problem  of  induction  to  be  reducible  to  that  which  arises  in  sampling  without 
replacement  from  an  urn  consisting  of  units  that  are  labelled  with  numerical 
values. 

The  solution  that  I  proposed  for  this  problem,  which  consists  of  a  model 
for  a  generalized  version  of  An  in  which  ties  can  occur,  will  be  discussed  in 
Section  2.  Historically,  the  sequence  of  events  concerning  A„  after  Dempster 
(1963)  was  as  follows.  I  proved  in  my  1968  article  that  An  cannot  hold  exactly 
for  countably  additive  proper  prior  distributions,  in  the  case  of  exchangeable 
sequences  in  which  ties  have  probability  0.  At  the  same  time  I  recommended  it 
as  an  approximation  for  a  variety  of  situations,  that  can  be  roughly  described 
as  situations  in  which  the  data  is  measured  on  a  “rubbery  scale,”  and  gave 
several  models  in  which  it  would  be  appropriate.  I  also  proved  in  Hill  (1968,  p. 
686)  that  An  for  all  n  implies  that  the  posterior  distribution  of  the  8 ,  defined 
earlier  is  the  uniform  Dirichlet  distribution  on  the  (n  1)  dimensional  simplex, 
thus  giving  support  to  Fisher’s  fiducial  argument.  Also,  Hill  (1967)  derived  the 
posterior  expectation  of  a  future  observation,  and  of  the  mean  of  the  population, 
using  An •  The  next  historically  significant  development  regarding  An  was  the 
proof  by  Lane  and  Sudderth  (1978),  using  finite  additivity,  that  An  for  all  n 
is  coherent  in  the  sense  of  de  Finetti,  i.  e.,  it  is  impossible  to  be  made  a  sure 
loser,  and  the  further  result  by  the  same  authors  (1984)  that  it  is  predictively 
coherent.  The  robustness  and  invariance  properties  of  A„  were  investigated  by 
myself  in  Hill  (1980a)  with  the  general  result  that  it  is  robust  in  the  modern 
Bayesian  sense  of  Berger  f  1965) ,  Hill  (1980b).  Then  my  doctoral  student  Peter 
Lenk  (1984)  showed,  along  with  many  other  things,  that  An  can  arise  as  a  limit 
of  proper  priors,  using  a  log  gaussian  model  for  the  prior  distribution  of  the 
unknown  density  function  of  the  population.  Next,  Berliner  and  Hill  (1988) 
used  An  to  obtain  the  predictive  distribution  for  future  observations  in  the  case 
of  censored  data,  as  for  example  in  survival  analysis.  Finally,  in  Hill  (1987b)  I 
have  constructed  a  class  of  simple  parametric  models,  called  splitting  processes, 
such  that  An  holds  for  all  n,  A  modification  of  this  construction  also  yields 
Hn  for  all  n.  The  Dirichlet  process  of  Ferguson  (1973)  turns  out  to  be  a  very 
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special  type  of  splitting  process.  It  is  also  shown  how  An  arises  from  sampling  of 
complex  mixtures  of  distributions,  and  the  relationship  with  the  oneway  random 
effects  analysis  of  variance  is  explained. 

In  the  next  Section  I  will  restate  An,  give  my  model  for  inference  and  pre¬ 
diction,  and  suggest  a  new  and  compelling  (for  me)  subjectivistic  argument  for 

An. 

2  An  and  Hn 

In  Hill  (1968)  a  direct  specification,  denoted  An,  for  the  posterior  predictive 
distribution  of  future  observations  was  proposed.  An  was  meant  to  express  ex¬ 
tremely  vague  subjective  prior  knowledge  as  to  the  form  of  the  underlying  pop¬ 
ulation  distribution.  For  the  case  of  n  =  1  and  2,  An  follows  from  conventional 
parametric  models  (Gaussian,  for  example)  with  a  uniform  prior  distribution 
on  the  location  parameter,  or  on  the  location  parameter  and  logarithm  of  the 
scale  parameter,  respectively,  Jeffreys  (1961,  p.  171),  Hill  (1968,  p.  688).  For 
example,  when  n  =  1,  suppose  that  the  parameter  6  is  the  mean  of  a  nor¬ 
mal  population  with  known  standard  deviation  of  unity.  Given  an  observation 
A'i  =  z\  from  this  population,  the  posterior  distribution  of  9  is  N(*i,  1).  The 
predictive  distribution  of  the  next  observation,  A'j,  is  then  easily  seen  to  be 
N(zi,  2).  Hence  the  posterior  probability  that  A'2  <  xx  is  .5.  Note  that  for 
any  prior  distribution  which  is  diffuse  relative  to  the  likelihood  function,  A\ 
will  hold  to  a  good  approximation,  since  the  posterior  distribution  of  9  will  still 
be  approximately  N(xi,  1).  A  similar  analysis  applies  in  the  case  n  =  2.  At 
the  time  of  Hill  (1968),  it  was  not  known  whether  -4n  could  be  obtained  for 
conventional  parametric  models  when  n  >  3.  However,  Hill  (1987b)  shows  that 
this  is  the  case  for  both  A„  and  H„. 

v4„  for  untied  data,  or  H„  for  the  case  of  ties,  are  exactly  appropriate  for 
data  measured  on  a  merely  ordinal  scale,  or  with  a  trivial  modification,  for 
data  that  consists  of  labels  (such  as  the  names  of  species,  as  in  the  species 
sampling  problem),  and  can  yield  an  extremely  good  approximation  for  data 
on  a  ratio  or  interval  scale,  such  as  the  weights  in  a  population  of  penguins, 
as  will  be  discussed  at  the  end  of  this  section.  The  cases  where  it  is  exactly 
appropriate  can  be  described  as  data  measured  on  a  “rubbery”  scale.  Just  as 
with  other  nonparametric  models,  it  is  hardly  necessary  for  the  assumptions 
to  hold  literally,  in  order  that  the  conclusions  be  appropriate  to  a  very  good 
approximation. 

The  condition  An  is  defined  as  follows.  ,4n  asserts  that  conditional  upon 
X\, . . . ,  A'„,  the  next  observation  A'„_i  is  equally  likely  to  fall  in  any  of  the  open 
intervals  between  successive  order  statistics  of  the  given  sample  (Hill,  1968,  p. 
677).  Note  that  in  our  definition  of  .4,,  we  do  not  assume  that  the  sequence 
is  necessarily  exchangeable  or  that  ties  have  probability  ii.  Thus,  we  can  also 
include  cases  where  there  :s  a  positive  probability  that  the  next  observation  ties 
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one  of  the  previous  observations,  and  also  partially  exchangeable  situations  that 
satisfy  An.  At  the  present  time  1  wish  to  slightly  modify  this  notation,  use  Hn 
to  denote  the  situation  in  which  ties  can  occur,  and  reserve  An  for  the  special 
case  of  Hn  in  which  their  are  no  ties  (or  ties  have  probability  0).  In  this  article 
I  will  also  assume  that  the  observations  are  exchangeable,  although  this  will  not 
be  included  in  the  definition  of  An  and  Hn. 

An  specifies  a  predictive  distribution  for  one  future  observation.  If  also  An+1 
holds,  then  by  conditioning  upon  which  interval  the  first  new  observation  falls 
in,  we  can  obtain  a  predictive  distribution  for  two  new  observations,  and  by 
extension  for  an  arbitrary  number  of  new  observations.  See  Hill  (1968,  p.664) 
for  such  predictive  schemes.  1  Furthermore,  we  can  use  this  same  idea  to  deal 
with  censored  data,  again  by  conditioning  upon  which  intervals  the  censored 
observations  will  fall  in.  Beriiner  and  Hill  (1988)  carry  through  such  an  analysis 
for  the  case  of  survival  data,  present  upper  and  lower  bounds  for  the  survival 
function,  and  simple  algorithms  with  which  to  make  the  analysis.  In  the  sur¬ 
vival  problem,  for  example,  we  assess  the  predictive  probability  distribution  for 
the  time  of  death  of  new  patients  given  a  treatment,  using  as  data  the  death 
times,  and  the  intervals  in  which  censoring  occurred,  i.  e.,  the  partial  censoring 
information,  for  a  previous  group  of  patients  who  were  given  the  treatment, 
and  with  whom  the  new  patients  are  regarded  as  exchangeable.  Chang  (1989) 
provides  additional  computational  algorithms,  and  extends  the  results  to  the 
two  sample  case. 

In  addition  to  de  Finetti  (1937,  1974).  other  key  references  on  exchangeabil¬ 
ity  are  Hewitt  and  Savage  (1955),  Savage  (1972),  Heath  and  Sudderth  (1976), 
and  Diaconis  and  Freedman  (1980,  1961).  (The  article  by  Heath  and  Sudderth 
gives  an  extremely  simple  and  yet  rigorous  proof  of  de  Finetti’s  theorem  for  the 
case  of  events.  The  articles  by  Diaconis  and  Freedman  do  so  for  the  general 
case.)  The  definition  of  exchangeability  that  we  shall  use  is  that  motivated  by 
the  subjective  Bayesian  viewpoint,  namely,  in  terms  of  a  subjective  judgment 
that  the  order  is  irrelevant.  (Mathematically,  this  is  the  same  as  all  other  def¬ 
initions  of  exchangeability  but  psychologically  it  is  different,  in  that  we  do  not 
assume  the  sequence  is  ‘truly’  exchangeable,  but  merely  that  one  regards  it  as 
exchangeable,  perhaps  only  as  an  approximation  to  the  truth.) 

To  be  precise,  let  A'lt . . .  .  Xk~i ,  be  k  —  1  random  variables  that  are  (finitely) 
exchangeable  in  the  subjective  Bayesian  sense;  that  is,  the  joint  distribution  of 
any  r  distinct  variables  is  the  same  as  that  for  any  other  such  r  variables,  r  = 
1  An  infinite  sequence  of  such  variables  is  said  to  be  exchangeable  if 

the  above  condition  is  true  for  each  k.  Such  models  arise  from  the  following 
Bayesian  formulation:  Assume  that,  given  some  distribution,  say  F.  A*i,...,Xn, 
are  independent  and  identically  distributed  according  to  F.  For  F  unknown  it  is 
natural  for  the  Bayesian  i<<  moaei  F  itself  as  ‘random’  with  some  apriori  prob- 

*Note  that  the  equation  it.  ;n t  lop  of  page  684  i»  onJy  valid  if  i  *  ;  When  i  =  j,  it 
is  necessary  to  add  another  lerrr  wruch  corresponds  to  the  possibility  that  the  second  new 
observation  ties  the  first.  A  s.rruiar  correction  is  necessary  in  the  formula  for  E{6X  x  S}). 


ability  specification.  This  can  be  done  either  parametrically  or  nonparametri- 
cally.  In  either  case,  ‘integrating  out’  F  leads  to  an  exchangeable  unconditional 
joint  distribution  for  the  X’s.  Conversely,  de  Finetti’s  theorem  implies  that 
if  the  exchangeable  sequence  is  infinite,  then  their  exists  a  distribution  on  F, 
called  the  prior  distribution  of  F,  for  which  the  joint  distribution  of  the  observa¬ 
tions  obtained  by  ‘integrating  out’  F  is  the  original  exchangeable  distribution. 
See  Hewitt  and  Savage  (1955).  Heath  and  Sudderth  (19 76),  and  Diaconis  and 
Freedman  (1980,  1981)  for  proofs.  The  1980  article,  which  emphasizes  the  finite 
exchangeable  case,  is  particularly  appropriate  for  my  purposes.  Thus  the  au¬ 
thors  show  that  the  most  general  exchangeable  sequences  arise  by  taking  limits 
of  the  finite  exchangeable  sequences  that  arise  in  sampling  without  replacement 
from  urns. 

I  will  now  present  my  model  for  .4n,  or  more  precisely,  for  my  generalization 
of  An,  called  Hn ,  which  allows  for  ties,  and  of  which  An  is  a  special  case. 

We  assume  that  there  exists  a  finite  population  of  units,  with  each  unit 
having  an  attached  value  or  label.  For  example  the  value  might  be  the  mass 
of  the  unit,  or  the  label  might  be  the  name  of  the  species  to  which  the  unit 
belongs.  We  assume  that  the  set  of  values  is  simply  ordered,  or  at  least  can  be 
simply  ordered.  By  a  simple  ordering  we  mean  a  relationship,  say  <,  which  for 
any  two  elements  x,y,  of  the  set  of  values,  is  such  that  either  *  <  y  or  y  <  x, 
and  which  is  transitive.  (See  Jeffreys  (1957,  Chs.  5-6),  Luce  and  Narens(1987), 
and  Whitrow  (1980,  Sec.  4.7)  for  discussions  of  the  concept  of  measurement.) 

Thus  masses  would  be  on  a  ratio  scale,  and  are  certainly  simply  ordered; 
while  labels  can  be  simply  ordered  for  a  finite  population  simply  by  designating 
an  ordering.  (This  can  be  done  for  infinite  populations  as  well,  using  the  well¬ 
ordering  theorem,  but  there  is  no  need  to  go  into  such  things  here.)  Suppose 
there  are  N  units  in  the  population,  and  that  the  set  of  attached  values  or 
labels  is  {Z,.i  =  1, . . . ,  jV }.  We  shall  now  refer  only  to  values,  with  it  being 
understood  that  we  include  labels  as  a  special  case  after  the  finite  population 
has  been  simply  ordered.  Some  of  the  values  Z ,  may  be  equal  to  one  another. 
Suppose  that  in  fact  there  are  only  M  distinct  values  amongst  the  Z„  and  denote 
these  in  ascending  order  of  magnitude  as  A*(j)  <  X(j)  <  ...  <  A'(Af),  where  of 
course  M  <  N.  Finally,  suppose  that  the  value  occurs  in  L,  units,  where 
L,  >  l  ,  since  by  assumption  the  value  A*(,j  does  in  fact  occur,  and  Lt  = 

N. 

The  above  model  constitutes  our  description  of  the  finite  population  of  val¬ 
ues  Z, .  Note  that  this  determines  the  empirical  distribution  of  values  in  the 
finite  population,  i.  e.,  the  empirical  distribution  has  jumps  occurring  at  X(j), 
...,  X(M),  and  the  jump  that  occurs  at  A'(l)  has  height  L,/N  .  Of  course  in 
general  all  of  these  quantities  are  unknown,  i.  e.,  N,  M,  the  A'(,),  and  the 
From  the  subjective  Bayesian  point  of  view  one  must  then  specify  a  probability 
distribution  for  all  of  these  quantities.  It  should  be  noted  that  this  point  of  view 
correspor-uj  exactly  to  the  recent  probabilistic  treatments  of  exchangeability  for 
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finite  sequences,  as  in  Diaconis  and  Freedman  (1980),  where  the  finite  exchange¬ 
able  sequence  of  length  N  is  the  vector  Y\, ...  ,Yn,  which  would  be  generated  by 
sampling  without  replacement  all  N  elements  of  the  finite  population,  so  that 
these  Y,  are  some  permutation  of  the  Z} . 

The  case  of  an  infinite  exchangeable  sequence  may  be  viewed  as  an  ideal¬ 
ization  of  this  scheme,  and  gives  rise  to  de  Finetti’s  theorem.  But  the  model 
in  terms  of  sampling  from  a  finite  population  is  simpler,  avoids  difficulties  and 
paradoxes  of  infinity,  is  more  realistic,  and  in  view  of  the  results  of  Diaconis  and 
Freedman,  loses  no  generality  in  any  case.  For  example,  in  my  model  we  require 
only  a  prior  distribution  for  the  composition  of  the  finite  population,  i.  e.,  for 
(M,  N,  X_,  £),  rather  than  a  prior  distribution  on  F,  the  theoretical  distribution 
for  an  infinite  exchangeable  population.  It  is  far  simpler  to  specify  such  a  prior 
distribution  on  the  finite  number  of  parameters  (at  most  2N  -i-2)  needed  to 
describe  this  finite  population,  than  to  do  so  on  the  infinite  dimensional  space 
of  distribution  functions  F.  Furthermore,  we  shall  argue  that  there  is  a  natu¬ 
ral  way  to  represent  vagueness  for  the  finite  population,  which  would  be  much 
more  difficult  to  achieve  for  an  infinite  population  (for  example,  one  would  have 
to  confront  some  basic  issues  concerning  the  difference  between  countable  and 
finite  additivity). 

Now  let  us  consider  the  data  that  we  shall  be  analyzing.  It  is  assumed  that  a 
simple  random  sample  is  drawn  without  replacement  from  the  finite  population 
that  we  have  described  above.  Let  the  sample  size  be  n.  The  data  will  consist 
of  the  numerical  values  attached  to  the  n  units  that  are  thus  selected  from  the 
finite  population.  Let  z(1)  <  x(2)  <  ...  <  X(m),  be  the  ascending  order  statistics 
of  the  sample,  with  m  distinct  values,  1  <  m  <  n,  and  with  n,  sample  units 
having  the  value  z (<).  Thus  n,  >  1  ,  and  ill’ll  tv*  =  n.  It  is  assumed  here  that 
the  values  are  measured  without  error,  so  that  each  Z(,j  is  necessarily  some  A'(2) 
in  the  population,  but  of  course  we  do  not  know  with  certainty  which.  By  data 
we  mean  the  set  of  m  distinct  c(l)  values,  and  the  n,.  Thus  the  data  determines 
the  empirical  distribution  of  the  sample,  but  is  more  informative  because  n  and 
the  n,  are  known  as  well.  We  now  require  only  one  further  bit  of  notation. 
Given  the  data,  define  Jx  to  be  the  rank,  in  the  population,  of  the  the  value 
X(.)  in  the  sample,  for  i  =  1, . . .  ,m.  The  vector  J_  =  (J i, . . . ,  Jm)  then  gives  the 
true  ranks,  in  the  population,  corresponding  to  the  sample  %-alues  X(t).  Thus 
1  ^  J  j  ^  3  2  ^  J  3 . . .  ^  J  jfi  ^  ,  because  of  the  fact  that  the  z^,j  and  A  ’(»  are 

strictly  ordered.  We  now  are  readv  for  the  basic  equations  of  the  Hill  model  for 

Hn. 

In  the  first  equation,  we  condition  on  the  true  composition  of  the  finite 
population,  by  which  we  mean  the  unknown  quantities  -V,  L.  M,  and  N.  This 
equation  gives  the  probability  for  observing  the  data  together  with  J_  =  j,  for 
each  possible  vector  of  ranks  y. 


13 


(1) 


Pr{  data ,  J  =  j  \  X,  L,  M,  N  }  =  Q  x  fj  Q*) , 

if  =  1 . m,  and  is  otherwise  0. 

Note  that  this  would  be  the  likelihood  function  for  the  population  quantities, 
except  that  we  have  included  J_  =  j  together  with  the  data,  because  this  is  the 
key  to  making  an  effective  evaluation;  the  ordinary  likelihood  function  would 
involve  a  mixture  of  (1)  with  respect  to  j.  For  sampling  with  replacement,  it  is 
only  necessary  to  replace  the  factor  (^’)  by  ( ,  etc.  We  shall  not  further  deal 
with  the  case  of  sampling  with  replacement,  since  sampling  without  replacement 
is  the  more  common,  more  difficult  to  analyze,  and  more  important  form  of 
sampling. 

The  next  step  is  to  integrate  out  over  the  unknown  X  values  in  the  pop¬ 
ulation.  In  general  such  an  integration  requires  the  assumption  of  countable 
additivity,  or  conglomerability  in  the  finitely  additive  theory,  as  in  Hill  and 
Lane  (1985).  However,  in  the  present  case  with  only  two  values  for  the  proba¬ 
bility  in  question,  i.  e.,  that  given  by  (1),  or  else  the  value  0,  it  follows  without 
any  additional  assumptions,  that 

Pr{data,  L  =  j\  L  =  l,M,N} 


=  (1)  1  x  ft  ( xPr  <  A’(n )  =  *d>.  •  •  •  A  0  -)  =  *(m)  \L=LM,N}.  (2) 

W’e  thus  obtain  the  basic  result, 


Pr{  j/  =  j,  L  =  i,  data  i  M,N  }  =  Pr{  data,  J_  =  j  \  L  =  [,  M,  N  }  x  Pr{  L  = 


=  (Nn)  lxn(y  *Pr{XUl)  =  z{1),...XUm)  =  z{m)\L=L  M,  N}xPr{L  = 

Clearly  all  mat  must  be  specified  in  order  to  make  further  evaluations  are 
simply  the  three  components  of  the  prior  distribution  on  the  composition  of  the 
population,  namely 


Pr{L  =  l'  M.  N  }.  (4) 


Pri  A0.)  =  =  z(m)  i  L -L  M,  X  }> 


(5) 


!  !  M,N  } 

(3) 
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and 


.  Pr{  M  \  N  }  x  Pr{N}.  (6) 

Although  our  primary  interest  in  this  article  is  the  specification  of  (5)  in  such 
a  way  as  to  express  diffuse  or  vague  knowledge  about  the  underlying  population 
of  values,  we  note  that  our  formulation  is  sufficiently  general  so  as  to  include 
conventional  parametric  specifications  as  well.  For  example,  we  may  be  of  the 
opinion  that  the  population  distribution  is  approximately  normal,  in  which  case 
the  distribution  of  A  can  be  chosen  so  that  the  are  order  statistics  of 
a  sample  from  a  normal  population,  and  similarly  for  any  other  parametric 
distribution.  We  shall  not  pursue  this  idea  here,  however,  since  the  most  basic 
case  is  the  nonparametric  one. 

We  shall  specify  (4)  and  (5)  as  follows: 


PrU  =  !IM.J»}=(":11)"‘,  (?) 

while,  for  each  possible  Zf1)t ...  ,Z(m\, 

^r{  =  x(i)>  •  •  •  > m)  =  x(m)  I  £  =  L  M,  N  }  (8) 

does  not  depend  upon  j. 

Any  specification  of  (4)  and  (5)  is  equivalent  to  a  specification  of  the  prior 
distribution  for  the  empirical  distribution  of  the  population,  given  M  rind  N. 
Obviously  this  can  be  done  in  infinitely  many  ways,  any  one  of  which  might  be 
appropriate  in  a  specific  real  world  situation.  But  it  is  of  value  to  single  out  those 
specifications  that  are  of  special  significance,  such  as  for  example  correspond  to  a 
diffuse  prior  distribution  (as  is  commonly  done  with  improper  prior  distributions 
on  conventional  parameters),  and  also  those  that  are  known  to  be  compatible 
with  much  real  world  data.  The  specification  (7)  that  I  originally  chose  was  to 
take  Pr{  L  =  l  \  M,  N  }  =  .  which  is  the  Bose-Einstein  distribution 

for  non-empty  cells,  as  in  Feller  (1968,  p.  40).  Thus  the  results  in  Hill  (1968) 
are  based  upon  this  choice,  while  those  in  Hill  (1980a)  discuss  the  robustness 
of  this  choice  within  the  class  of  exchangeable  distributions  for  L.  My  doctoral 
student,  Wen-Chen  Chen,  in  his  Ph.  D.  dissertation  (1978)  and  Chen  (1980) 
generalized  this  choice  to  include  arbitrary  symmetrical  Dirichlet-multinomial 
distributions,  and  argued  that  for  some  data  it  is  desirable  to  choose  a  Dirichlet 
prior  other  than  the  Bose-Einstein,  which  of  course  is  a  Dirichlet-multinomial 
corresponding  to  a  uniform  Dirichlet  distribution.  See  also  Lewins  and  Joanes 
(1984)  and  Boender  and  Kan  (1987)  who  use  the  same  model.  My  primary 
motivation  for  the  Bose-Einstein  distribution  (which  I  still  regard  as  the  single 
most  appropriate  choice)  is  the  connection  with  Zipf’s  Law.  This  law  represents 
more  real  world  data  than  any  other  known  law.  including  the  Gaussian.  It  is 
shown  in  my  articles  Hill  (1970,  1974a.  1975a.  1979.  1980a,  1981),  and  in  Hill 
and  Woodroofe  (1975)  and  Woodroofe  and  Hill  (1975),  that  the  Bose-Einstein 
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choice  yields  Zipf’s  Law.  This  is  why  I  singled  it  out  as  of  special  significance 
within  the  class  of  exchangeable  prior  distributions  for  L.  See  also  Ijiri  and 
Simon  (1975)  for  discussion  of  the  Bose-Einstein  distribution.  Of  course  it  is 
mathematically  straightforward  to  replace  the  Bose-Einstein  distribution  by  any 
other  Dirichlet-multinomial  distribution,  and  sometimes  this  may  be  of  value  in 
modelling  the  data.  The  logic  underlying  my  model  would  only  at  best  suggest 
that  the  distribution  of  L  should  be  chosen  to  be  exchangeable,  and  even  this 
is  not  really  necessary.  See  also  Hill  (1987b)  for  the  relationship  between  my 
model  for  Zipf’s  Law  and  the  random  discrete  distributions  of  Kingman  (1975). 

Next,  one  must  also  make  some  specification  for  the  prior  distribution  of 
M  and  N.  The  most  basic  case  for  inference  is  simply  where  N  is  known  to  be 
large,  and  M  has  a  uniform  distribution,  given  N.  This  was  the  case  considered 
in  Hill  (1968).  Hill  (1979)  then  considered  the  case  where  M  has  a  truncated 
negative  binomial  distribution,  of  which  the  uniform  is  a  special  case.  Although 
the  specification  of  the  distribution  for  M,  given  N,  is  of  lesser  importance  here 
than  the  specification  of  (4)  and  (5),  it  does  play  a  crucial  role  in  obtaining 
Zipf’s  Law,  as  in  the  cited  articles  by  myself  and  by  Chen. 

Even  more  important  than  the  choice  of  the  Bose-Einstein  distribution  for 
L  is  the  choice  of  (8).  Here  we  directly  confront  the  problem  of  formulating  a 
diffuse  prior  distribution  on  the  empirical  distribution  of  the  population.  Note 
that  if  M  =  N,  so  that  all  L ,  =  1,  then  we  obtain  the  case  where  ties  have  proba¬ 
bility  0,  and  must  then  only  express  vagueness  of  opinion  about  the  jump-points 
A'(i)  in  the  population.  Thus  the  problem  of  expressing  a  diffuse  distribution 
for  the  jump-points  is  logically  independent  of  that  of  expressing  one  for  L.  It 
was  shown  by  Lane  and  Sudderth  (1976,  Theorem  1),  defining  An  for  the  case 
where  ties  have  probability  0  and  where  the  sequence  is  exchangeable,  that  (8) 
is  equivalent  to  An. 

Consider  then  the  specification  (8).  What  it  says  is  that  no  matter  what  the 
distinct  values  may  be,  they  contain  no  information  whatsoever  about  the 
ranks  J,  of  these  values  in  the  population.  Clearly  this  is  not  always  appropriate. 
For  example,  if  one  believed  that  the  population  was  approximately  Gaussian 
in  form,  then  one  would  favor  some  j  vectors  over  others.  Or  if  one  knew 
sufficiently  much  about  the  set  of  values  in  the  population,  then  one  might 
know,  for  example,  that  i(rn)  was  in  fact  the  largest  value  in  the  population.  Or 
again,  if  the  are  necessarily  integers,  and  if  two  are  consecutive,  then  one 
knows  that  the  corresponding  J,  are  also  consecutive.  To  understand  the  force 
of  the  argument  for  (8)  however,  consider  the  following  example. 

Suppose  for  the  sake  of  argument  that  there  are  100,000  adult  male  emperor 
penguins,  and  that  their  weights  can  be  measured  sufficiently  precisely  so  that 
no  two  agree  exactly.  (This  is  assumed  only  to  make  the  essential  point  clear. 
My  model  Hn ,  with  ties,  can  deal  with  any  degree  of  rounding.)  Consider  your 
apriori  subjective  opinions  about  the  population  of  weights  of  these  penguins. 
Suppose  now  that  I  were  to  give  you  all  but  one  of  these  weights  as  the  data  x 
i.  e.,  99999  positive  numbers,  no  two  of  which  are  equal.  The  question  I  wish 
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you  to  think  about  concerns  your  opinions  about  the  vector  J,  which  specifies 
the  ranks  of  these  99999  numbers  in  the  population  of  all  100000  numbers. 
Condition  (8)  here  would  require  that  you  be  aposteriori  indifferent  as  to  which 
ranks  these  observations  have  in  the  population  with  M  =  N  =  100,000.  Note 
that  this  is  meant  to  apply  no  matter  what  the  values  are,  provided  that 
only  possible  values  are  included,  so  that  negative  values  are  excluded,  as  well  as 
weights  that  are  known  to  be  impossibly  large  or  impossibly  small.  For  example, 
if  (8)  holds,  then  you  are  indifferent  as  to  which  of  the  100000  possible  values  is 
missing  in  the  data.  It  could  just  as  well  be  the  largest  as  the  smallest,  or  any 
other  member  of  the  population.  Thus  it  would  be  the  largest  that  is  missing  if 
it  were  the  case  that  J_  consists  of  the  ranks  1, ... ,  99999  in  the  population,  and 
it  would  be  the  smallest  that  is  missing  if  J  consists  of  the  ranks  2, ... ,  100,  000 
in  the  population.  Are  you  so  indifferent? 

A  fairly  natural  first  reaction  is  to  say  that  you  might  or  might  not  be 
indifferent,  depending  upon  what  the  numbers  *(,)  that  I  give  you  are.  And 
you  might  feel  that  for  lots  of  such  sets  of  99999  numbers  you  might  be,  and  for 
others  you  might  not  be.  But  think  again.  Suppose,  to  take  an  extreme  case 
that  might  seem  to  speak  against  .4(99999),  that  the  that  I  give  you  are  such 
that  there  is  an  enormous  gap  between  the  largest,  *(99999).  and  all  the  others. 
In  fact,  suppose  that  *(99999,  is  an  extremely  large  value,  say  1000  pounds, 
one  that  (although  pe-haps  not  impossible),  seems  highly  improbable,  while  the 
other  sample  weights  are  all  less  than  100  pounds.  You  do  not  appreciate  the 
full  force  of  An  until  you  realize  that  if  the  largest  weight  in  the  sample  were  in 
fact  1000  pounds,  then  there  might  well  be  another  penguin  that  weighs  even 
more  than  this!  Thus  the  naive  reaction,  which  would  be  that  no  penguin  weighs 
anything  like  1000  pounds,  is  immediately  dispelled  once  one  fully  appreciates 
the  fact  that  you  have  already  seen  one  such  (in  the  scenario  of  the  problem) 
and  may  therefore  well  see  another  one.  Still  another  example  of  this  type 
concerns  human  age.  One  might  well  regard  it  as  extraordinarily  improbable 
that  any  human  being  has  lived  to  the  age  500  years.  But  if  one  such  could  be 
demonstrated,  then  you  might  well  think  that  another  might  also,  and  even  find 
that  your  opinions  were  roughly  in  accord  with  An.2 

What  condition  (8)  is  expressing  is  a  completely  pragmatic  attitude  towards 
the  population.  Such  an  attitude  is  not  only  a  subjectively  Bayesian  coherent 
attitude  but  in  the  case  at  hand  even  seems  quite  compelling;  and  this  is  for 
the  case  of  weights  on  a  ratio  scale,  which  is  the  worst  type  of  example  for  A„, 
as  opposed  to  data  on  a  merely  ordinal  scale,  such  as  the  Mohs  scale  for  hard¬ 
ness  of  rocks,  where  the  hardness  values  are  more  or  less  meaningless.  See,  for 
example,  Whitrow  (1980,  p.  216).  And  yet  I  think,  after  reflection,  you  may 
find  it  compelling  even  in  the  extreme  example  I  have  given.  It  would  of  course 

JThe  Encyclopedia  Americana.  1981,  referring  to  penguin*,  (tales  "In  tire  they  range  from 
the  gigantic  emperor  penguins,  standing  about  40  inches  high,  and  weighing  up  to  90  pounds, 
to  the  diminutive  fairy  penguin  of  the  Australian  region  that  attains  a  length  of  just  over  a 
foot. 
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be  even  more  compelling  if  the  largest  weight  in  the  data  were  say,  120  pounds, 
rather  than  1000  pounds.  The  general  argument  that  I  would  give  is  that  (8) 
with  m  =  M  -  1,  and  N  =  M  (so  there  are  no  ties),  is  a  highly  compelling 
subjective  evaluation,  and  this  implies  Am-i-  Note  that  there  is  no  possibility 
of  a  mathematical  proof  that  (8)  is  ‘correct,’  just  as  there  is  never  any  way  of 
proving  that  one  ordinary  prior  distribution  is  more  appropriate  than  any  other. 
All  prior  distributions  are  possible,  and  each  is  to  be  given  ‘equal  rights,’  as  de 
Finetti  says.  But  just  as  some  prior  distributions  are  sometimes  regarded  as 
more  appropriate  than  others,  for  example,  a  uniform  prior  distribution  on  the 
parameter  of  a  Bernoulli  process  is  sometimes  regarded  as  particularly  appropri¬ 
ate,  so  too  I  claim  that  (8)  is  quite  compelling,  and  1  personally  regard  it  is  the 
most  generally  appropriate  specification.  My  reasons  are  perhaps  not  entirely 
unrelated  to  those  of  Bayes  (1764),  and  the  fiducial  intuitions  by  ‘Student’  and 
Fisher,  . 

That  An  for  large  n  should  be  highly  compelling  also  agrees  with  certain  fre- 
quentistic  ideas  in  conventional  nonparametric  statistics.  Very  few  statisticians 
use  parametric  models  when  dealing  with  large  samples  from  some  underlying 
population  F.  The  reason  is  that  one  is  nearly  certain  that  the  true  distribution 
is  not  of  any  specific  parametric  form,  for  example,  Gaussian,  and  that  with  a 
sufficiently  large  sample  the  discrepancies  will  almost  certainly  appear  and  be 
serious.  This  is  part  of  the  approach  to  hypothesis  testing  of  J.  Berkson  (1938), 
for  example,  who  pointed  out  that  with  a  sufficiently  large  sample  you  will  cer¬ 
tainly  reject  most  conventional  null  hypotheses.  Thus  for  a  sufficiently  large 
sample  one  might  be  nearly  certain  that  the  data  will  allow  rejection  of  any  pre¬ 
specified  fixed  dimensional  parametric  model,  even  using  a  subjective  Bayesian 
test  of  the  hypothesis,  for  which  it  is  more  difficult  to  reject  the  null  hypothesis. 
On  the  other  hand,  if  the  sample  from  the  very  same  population  F  were  suffi¬ 
ciently  small,  then  one  might  well  use  the  Gaussian  or  some  other  parametric 
model.  Because  of  the  relationship  of  An  to  the  empirical  distribution  function, 
as  in  Berliner  and  Hill  (1988,  p.  773),  it  is  clear  that  the  same  considerations 
that  make  conventional  statisticians  prefer  the  empirical  distribution  function 
when  dealing  with  large  samples  should  also  apply  to  An. 

Now  we  come  to  a  rather  strange  and  interesting  fact.  Suppose  I  have  man¬ 
aged  to  convince  you  of  the  appropriateness  of  Am- i-  But  it  is  a  mathematical 
fact,  proved  in  Hill  (1968.  p.688),  that  A*  implies  A}  for  j  <  k.  Thus  if  you 
accept  Am- i  as  appropriate  exactly,  then  you  are  forced  into  Ai  as  well.  Of 
course,  both  A  \  and  A i  correspond  to  conventional  Bayesian  and  frequentistic 
procedures,  with  a  diffuse  prior  on  location,  or  on  location  and  scale  parame¬ 
ters,  respectively,  and  they  are  certainly  sometimes  appropriate  as  an  approx¬ 
imation.  But  it  is  equally  dear  that  they  are  not  always  appropriate.  How 
are  we  to  explain  this?  My  argument  for  Am-\,  which  I  regard  as  extremely 
compelling  when  M  is  large,  if  accepted,  then  implies  A\  as  well,  which  is  not 
always  compelling.  I  believe  that  the  explanation  is  as  follows.  In  my  proof 
that  .4*  implies  A}  for  j  <  k  there  is  a  backwards  induction.  In  carrying  the 
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argument  backwards,  it  is  possible  that  slight  discrepancies  from  Am-\  may 
build  up,  yielding  a  possibly  much  larger  discrepancy  for  Aj.  I  should  also 
point  out  that  even  Am- i  need  not  hold  literally.  For  example,  suppose  that 
you  knew  a  great  deal  about  the  average  weight  in  the  population  of  penguins. 
Then  if  I  gave  you  all  but  one  of  these  weights,  you  would  have  a  good  idea 
about  the  missing  weight.  Indeed,  you  would  know  it  exactly  if  you  knew  the 
average  weight  exactly.  Similarly,  one  might  observe  that  a  particular  weight 
that  one  knows  occurs  in  the  population  is  missing  in  the  sample.  Thus  one 
might  have  a  discrepancy  even  from  Am- i,  and  this  could  build  up  even  more 
in  reaching  down  to  A\.  It  is  considerations  such  as  these  which  point  out  the 
importance  of  recognising,  once  and  for  all,  that  we  are  at  best  only  dealing 
with  approximations.  These  approximations  can  nonetheless  be  very  useful.  It 
is  my  opinion  that  the  nonparametric  formulation,  as  in  Hn,  although  itself  only 
an  approximation,  is  ordinarily  the  most  important  way  to  perform  predictive 
inference,  with  parametric  representations,  such  as  the  Gaussian,  being  useful 
primarily  for  inference  and  prediction  when  the  sample  size  is  small. 

I  remarked  earlier  that  An  is  exactly  appropriate  for  merely  ordinal  data, 
such  as  hardness  of  rocks,  in  the  absence  of  ties.  The  argument  is  as  follows. 
Suppose  one  draws  a  simple  random  sample  of  n  rocks  from  a  population  of  N 
rocks  in  which  no  two  are  of  the  same  composition  or  of  the  same  hardness. 
(Here,  as  is  usual,  one  rock  is  said  to  be  harder  than  another  if  it  scratches 
the  other  rock.)  Before  the  data  is  taken  you  are  surely  of  the  opinion  that  J_ 
is  equally  likely  to  be  any  of  the  (^)  possible  j  vectors,  since  this  is  precisely 
what  sampling  without  replacement  means.  In  "the  present  case,  however,  even 
after  the  sample  is  drawn  you  must  still  be  of  the  same  opinion,  since  no  ‘data’ 
becomes  available  other  than  the  relative  orderings  of  the  rocks  in  hardness, 
i.  e.,  there  are  no  values.  (Even  if  some  arbitrary  scale  is  used,  such  as  the 
Mohs  scale,  it  means  nothing,  and  its  ‘values’  are  totally  uninformative,  as 
in  Whitrow  (1980,  p.  216).)  Thus  in  this  situation  one  is  forced  to  make  the 
evaluation  (8).  Furthermore,  this  provides  a  justification  for  the  original  fiducial 
intuition,  which  also  ignores  the  ‘values’  of  the  observations.  Finally,  if  we  now 
consider  the  case  of  ties,  as  for  example  if  we  draw  a  simple  random  sample  from 
the  rocks  on  some  mountain,  then  it  can  easily  be  seen  that  Hn  rather  than  A„ 
applies,  using  an  appropriate  exchangeable  distribution  for  L. 

Finally,  in  the  case  of  'colors’  or  ‘species,’  the  natural  way  to  proceed  is 
as  follows.  Suppose  that  we  go  to  a  new  region  and,  taking  samples,  find  n 
living  creatures  that  we  decide  belong  to  m  distinct  species.  We  can  number 
these  species  in  any  way  we  like,  for  example,  we  can  take  species  1  to  be 
the  first  type  caught,  etc.,  with  species  m  the  last  type  caught  in  our  sample. 
Define  the  quantity  •>,,  i  =  l,...,m,  to  be  the  proportion  of  the  unsampled 
population  belonging  to  the  same  species  as  the  ith  sample  species,  and  0,  to 
be  the  proportion  of  the  unsampled  population  with  value  strictly  between  X(,j 
and  Z(i-i),  just  as  in  Hill  ( 1968,  p.  682).  In  this  case  the  9 ,  are  necessarily  0  for 
i  <  m  —  1,  since  any  creature  belonging  to  a  new  species  must  then  be  given  a 
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number  larger  than  m.  Then  although  neither  An  nor  Hn  is  exactly  appropriate, 
it  is  shown  in  Hill  (1960a)  that  the  posterior  distribution  of  the  quantities  ■?,, 
i  =  1, ...  ,m,  is  exactly  as  under  Hn ,  and  that  the  posterior  distribution  of  M 
and  the  posterior  probability  of  catching  a  new  species  is  as  in  Hill  (1966,  p. 
681,  p.  691;  1979) 

3  On  the  meaning  of  parameters 

The  role  and  meaning  of  parameters  in  the  de  Finetti  theory  is  quite  different 
from  that  in  conventional  statistics.  Consider  a  finite  exchangeable  sequence 
of  0-1  valued  observations,  A',,  for  1  =  1,  ...,1V.  In  the  de  Finetti  approach, 
there  need  not  be  any  pre-existing  ‘true’  probability,  p,  for  a  success,  i.  e., 
for  A't  =  1.  However,  according  to  de  Finetti’s  theorem,  if  the  sequence  were 
infinite,  then  one  would  implicitly  be  acting  as  though  there  were  such  a  p,  and 
the  prior  distribution  for  such  a  p  would  be  simply  the  de  Finetti  measure  ir 
for  the  sequence,  i.  e.,  it  is  as  though  the  a  priori  distribution  for  p  was  ir, 
and  one’s  opinions  about  the  observable  A',  were  such  that  conditional  on  p,  the 
observations  formed  a  Bernoulli  sequence.  If  the  sequence  is  only  finite,  but  N  is 
sufficiently  large,  to  a  good  approximation  the  same  thing  is  true.  In  this  case, 
p  is  simply  the  average  of  the  N  random  quantities,  p  -  A  =  Xi)/N. 

Conditional  upon  this  p,  one  no  longer  has  exact  independence  of  the  A^,  but 
some  degree  of  dependence.  The  difference  between  the  infinite  case  and  the 
finite  case  amounts  to  the  difference  between  sampling  with  replacement  versus 
without  replacement  from  an  urn.  See  Heath  and  Sudderth  (1976)  and  Diaconis 
and  Freedman  (1960).  Of  course  all  real  world  sequences  are  necessarily  finite, 
but  for  moderate  N  the  difference  between  the  infinite  and  finite  case  is  of  little 
importance,  and  one  uses  the  infinite  case  as  a  convenient  approximation  to  the 
finite  case.  This  is  also  the  spirit  in  which  1  originally  proposed  An. 

In  this  formulation  note  that  before  the  sequence  is  actually  determined, 
for  example,  before  the  coin  is  flipped,  there  is  no  pre-existing  p,  and  what  p 
actually  represents  is  the  random  average  A  of  the  N  observables,  which  is  as  yet 
to  be  determined.  The  a  priori  distribution  for  p  is  merely  the  prior  distribution 
for  A,  and  this  is  in  fact  a  useful  way  to  elicit  opinions  about  the  conventional 
Bernoulli  parameter  p.  Although  p  is  usually  thought  of  as  a  quantity  with  an 
objective  existence  even  before  the  coin  is  tossed,  this  is  not  realiv  the  case. 
Of  course,  one  can  imagine  if  one  likes,  that  the  coin  has  already  been  tossed 
N  times,  so  that  the  A’,  have  already  been  determined,  but  that  one  has  not 
yet  observed  them.  In  this  case  there  would  be  an  existing  quantity,  which  is 
as  yet  unknown,  and  is  simply  A  for  the  realized  sequence  A,.  Provided  there 
is  no  additional  information,  in  the  subjective  Bayesian  framework  it  is  then 
precisely  as  though  the  tosses  had  not  yet  been  made.  See  Hill  (1968,  Sec.  3) 
for  futher  discussion.  It  is  -rrn  largely  immaterial,  for  practical  purposes,  which 
point  of  view  one  taxes  as  to  the  'objective’  existence  of  p.  [Note,  however,  that 
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even  in  the  case  where  ,Y  has  already  been  realized,  the  interpretation  of  this 
quantity  as  the  ‘true’  probability  cannot  be  made  without  assumptions  as  to 
the  sampling  mechanism.)  In  this  framework  the  distinction  between  ‘inference’ 
and  ‘prediction’  becomes  blurred.  On  the  one  hand,  if  the  sequence  has  not  yet 
been  determined,  one  would  view  p  as  a  random  quantity  which  one  might  want 
to  predict,  i.  e.,  it  is  the  future  proportion  of  heads.  On  the  other  hand,  if  the 
sequence  has  been  determined,  but  is  as  yet  unobserved,  then  p  might  be  thought 
of  as  a  parameter  in  the  conventional  sense.  The  upshot  of  this  discussion  is 
that  the  usual  sharp  distinction  between  prediction  and  parametric  inference  is 
largely  illusory. 

The  situation  with  regard  to  parameters  in  An  is  more  subtle.  Given  the 
data,  A',,  for  i  =  1, ...  ,n,  I  have  defined  the  ‘parameters’  9,  and  7 ^  to  be  the 
proportions  of  observations  in  the  unsampled  population,  between  and  at  the 
order  statistics  of  the  data.  Such  parameters  are  defined  in  terms  of  the  data, 
and  so  are  not  the  usual  kinds  of  parameters.  Nonetheless,  they  are  unknown 
quantities,  and  so  in  the  de  Finetti  theory  one  can  deal  with  them  just  as  with 
any  other  unknown  or  ‘random’  quantities.  Because  the  sequence  of  observable 
random  quantities,  A',,  is  viewed  as  exchangeable,  it  follows  from  the  general 
form  of  the  theorem  of  de  Finetti,  that  one  is  acting  as  though  one  had  a  distri¬ 
bution  r  on  the  space  of  all  possible  distribution  functions,  F.  In  principle  the 
situation  is  as  follows.  Given  the  data,  the  prior  distribution  t  is  updated,  as 
usual,  to  become  a  posterior  distribution  7r’,  and  posterior  predictive  probabil¬ 
ities  for  future  observables  can  be  obtained  by  taking  expectations  with  respect 
to  ir'.  For  example,  if  ties  have  probability  0,  then 


Pr{A'„+1  £  /,  i  data)  =  E{6,  j  data]  =  E[F{x{t))  -  F(xi,_1))  |  data), 

where  in  this  equation  F  is  the  empirical  distribution  function  for  the  unsampled 
population3,  and  where  the  expectation  is  taken  with  respect  to  the  posterior 
distribution  of  F.  Thus  despite  the  fact  that  the  6,  depend  upon  the  data  for  their 
definition,  in  principle  their  posterior  expectation  can  be  defined  in  terms  of  the 
‘parameter’  F,  just  as  in  conventional  parametric  statistics.  Here,  practically 
speaking,  F  is  simply  the  empirical  distribution  for  the  entire  finite  population  of 
the  A\,  for  large  N,  and  plays  the  same  role  as  does  A'  for  a  Bernoulli  sequence. 
The  only  aspect  that  is  more  subtle  is  the  fact  that  because  we  are  dealing  with 
the  huge  space  of  all  possible  empirical  distributions,  it  is  difficult  analytically 
to  specify  the  prior  and  posterior  distributions  tr  and  ir\  respectively.  However, 
the  parametric  model  of  Hill  { 1957b ;  makes  it  clear  just  what  these  distributions 
are. 

The  two  cases  we  have  considered  here,  namely  the  0-1  case,  and  the  fully 
nonparametric  case,  are  ir.  fact  the  extreme  cases  with  respect  to  complexity  of 
the  underlying  model.  Much  of  statistical  inference  and  prediction  takes  place  in 

sThi*  distinction  is  nece»»»rv  when  N  i»  finite 
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an  intermediate  case,  namely  of  a  conventional  parametric  model  for  real-valued 
observations.  However,  such  intermediate  cases  can  be  considered  in  much  the 
same  way.  Consider,  for  example,  the  case  of  an  exponential  model  for  data. 
Let  the  parameter  be  taken  to  be,  say,  a,  the  expectation  of  the  exponential 
distribution.  Again  imagine  only  a  finite  population  of  values,  A',,  for  i  = 
1 ,  ...,jV,  and  consider  X  for  this  population.  Then  one’s  apriori  distribution 
for  at  is  approximately  one’s  prior  distribution  for  X,  and  conditional  upon  A', 
the  observations  are  approximately  independently  distributed  according  to  an 
exponential  distribution  with  ‘parameter’  A'. 

The  final  point  I  wish  to  discuss  concerns  the  role  of  Bayesian  data  analysis 
with  respect  to  An.  In  Hill  (1987b)  two  theorems  are  proved.  The  first  gives  a 
simple  parametric  model,  called  a  nested  splitting  process,  that  gives  rise  to  An 
exactly.  The  second  shows  that  from  a  subjectivistic  point  of  view,  Am  holds  in 
sampling  from  complex  mixtures  of  distributions,  where  here  m  represents  the 
number  of  groups  or  types  formed  from  the  n  observations  via  data  analysis. 
For  example,  in  sampling  from  the  population  of  cetaceans  (whales,  porpoises, 
dolphins)  the  sample  animals  may  be  classified  according  to  species  (or  other 
variables)  into  m  groups.  From  my  point  of  view,  such  classification  should 
be  done  by  a  form  of  data  analysis.  After  performing  such  classification,  the 
statistical  problem  can  be  reduced  to  one  concerning  the  random  effects  model 
in  the  analysis  of  variance.  See  Box  and  Tiao  (1973),  Hill  (1965,  1967,  1977, 
1980b),  and  Lindley  and  Smith  (1972)  for  Bayesian  analysis  of  such  models. 

In  Hill  (1988)  a  theory  of  Bayesian  data  analysis  is  put  forth  in  which,  be¬ 
cause  of  computational  complexity,  or  because  of  thoughts  that  are  triggered  off 
during  the  analysis  of  the  data,  a  departure  is  made  from  the  classical  Bayesian 
theory  in  which  models  and  prior  distributions  are  all  specified  before  seeing 
the  data.  I  believe  that  this  modification  is  essential  in  order  to  make  the  clas¬ 
sical  Bayesian  approach  more  realistic  in  applications.  Any  scientist  worth  his 
salt  would  play  with  his  data,  analysing  it  in  a  variety  of  ways,  and  giving  free 
rein  to  his  imagination  and  creativity.  As  argued  in  Hill  (1985),  classical  non- 
Bayesian  theory  breaks  down  completely  in  connection  with  such  data  analysis, 
since  all  probabilities  would  have  to  be  conditional  on  the  exact  procedures  em¬ 
ployed,  including  their  order,  and  even  the  thoughts  that  cross  one’s  mind.  This 
also  poses  a  challenge  for  the  Bayesian  approach.  However,  one  can  view  the 
data-analytic  procedures  as  occurring  prior  to  the  point  at  which  the  Bayesian 
analysis,  proper,  begins;  and  it  should  be  observed  that  the  Bayesian  theory  has 
always  had  an  arbitrary  element  in  it  as  to  the  time  point  at  which  one  proceeds 
to  make  a  formal  Bayesian  analysis.  1  have  suggested  that  often  the  appropriate 
point  is  following  the  process  of  data  analysis.  After  one  has  reformulated  old 
models,  or  formulated  entirely  new  models,  by  means  of  data  analysis,  one  can 
proceed  with  the  classical  form  of  Bayesian  reasoning,  including  robustness  and 
sensitivity  analysis,  as  in  Berger  (1984),  Hill  (1980b).  See  also  Hacking  (1967) 
and  Smith  (1986). 

In  the  context  of  An  and  Hn  this  means  that  the  ‘parameters’  0,  and  y,  are 
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actually  the  results  of  such  Bayesian  data  analysis,  as  discussed  in  Hill  (1987b). 
This  need  not,  however,  change  the  basic  interpretation  of  these  quantities.  As 
shown  in  Hill  (1988),  a  generalization  of  the  restricted  likelihood  principle  of 
Hill  (1987a)  remains  valid  in  the  context  of  data  analysis.  Of  course  classical 
non-Bayesian  reasoning,  for  example  conventional  asymptotic  theory,  becomes 
entirely  irrelevant.  However,  in  low  dimensional  problems  one  can  still  plot  like¬ 
lihood  functions,  and  these  may  turn  out  to  be  sharp  relative  to  ‘apriori’  dis¬ 
tributions  for  the  parameters  introduced  following  the  data  analysis.  The  force 
of  such  a  Bayesian  analysis  of  data  must  depend  upon  an  agreement  amongst 
scientists  that  specific  prior  distributions  and  likelihood  functions  are  pertinent 
to  the  problem,  and  can  be  considered  on  their  own  merits,  even  after  the  data 
has  been  observed.  In  high  dimensional  problems  one  must  learn  new  techniques 
for  the  analysis  and  display  of  likelihood  functions,  as  in  Hill  (1975)  with  an  ex¬ 
ample  concerning  the  tails  of  distributions.  See  Mosteller  and  Wallace  (1964), 
and  Hill  (1987b,  1988)  for  examples  of  Bayesian  data  analysis. 

4  CONCLUDING  REMARKS 

The  initial  intuition  as  regards  A„  seems  to  be  due  to  ‘Student,’  or  at  least 
Fisher  (1939)  implies  that  this  is  the  case.  Fisher  then  generalized  the  idea  and 
interpreted  it  in  a  fiducial  spirit,  which  Dempster  (1963)  crystrallized  and  called 
‘direct  probability.’  Note  that  for  all  three  of  these  authors  the  justification  for 
An  seems  to  be  purely  intuitive.  Thus  none  give  anything  vaguely  represent¬ 
ing  a  ‘proof’  for  An,  or  suggest  a  wav  in  which  its  coherency  or  rationale  can 
be  discussed,  or  even  indicate  in  what  circumstances  it  might  or  might  not  be 
appropriate.  While  I  believe  that  sound  intuition  (or  inspiration)  is  what  all 
scientific  progress  ultimately  comes  from,  it  is  nonetheless  the  case  that  a  criti¬ 
cal  attitude  is  necessary,  and  that  one  must  ask  when  and  why  An  is  sensible, 
and  whether  there  are  any  qualifications  and  pitfalls  associated  with  it.  For 
example,  one  must  ask  immediately,  when,  if  ever,  should  one  be  using  conven¬ 
tional  parametric  models  as  opposed  to  An.  It  is  in  helping  to  understand  such 
questions  that  I  believe  the  subjective  Bayesian  approach  plays  a  fundamental 
role. 

Consider,  for  example,  the  case  of  a  normal  model  with  known  standard 
deviation  of  1  and  unknown  mean,  6.  The  fiducial  argument  of  Fisher  suggests 
that  the  pivotal  quantity  (A*  -8)  should  continue  to  have  the  N(0,1)  distribution 
even  after  X  is  replaced  by  its  observed  value  x,  which  is  a  number.  (The 
confidence  argument  of  Nevman  does  not  assume  this,  but  provides  no  way 
of  telling  whether  there  is  anything  peculiar  about  the  particular  x  for  which 
the  confidence  is  quoted.  As  is  well  known,  there  are  many  examples,  such  as 
the  Fieller-Creasy  example,  in  which  such  a  procedure  is  patently  absurd,  in 
that  the  whole  real  line  may  have  confidence  95  percent.  More  generally,  such 
confidence  procedures  do  not  provide  a  way  to  allow  one  to  deal  with  data  for 


which  the  conventional  confidence  level  is  obviously  inappropriate  based  upon 
prior  knowledge  that  is  generally  accepted.  Thus  the  confidence  argument,  as 
applied  in  practice  by  sensible  statisticians,  is  instead  a  conditional  argument, 
i.  e.,  it  is  conditional  upon  not  getting  data  that  is  wildly  contrary  to  prior 
knowledge.  As  shown  in  Hill  (1985),  the  ‘true’  confidence  coefficients,  when 
adjusted  to  be  conditional  upon  not  getting  such  data,  are  necessarily  both 
unknown  and  unknowable.  ) 

The  Bayesian  argument  goes  far  beyond  this.  It  first  tells  one  that  if  one  has 
a  prior  distribution  for  6  which  is  sufficiently  diffuse  relative  to  the  likelihood 
function,  then  in  fact  Fisher’s  fiducial  conclusion  is  justified.  (This  fact,  which 
Harold  Jeffreys  had  been  telling  Fisher  for  years,  seems  finally  to  have  been 
accepted  by  Fisher,  as  the  previously  quoted  footnote  of  Fisher  (1959,  p.  51) 
seems  to  indicate.)  Next,  the  Bayesian  argument  tells  you  that  there  are  many 
situations  in  which  instead  the  prior  distribution  may  be  sharp  relative  to  the 
likelihood  function,  in  which  case  the  appropriate  conclusions  are  quite  different; 
and  still  again,  there  are  important  cases  in  which  the  prior  and  the  likelihood 
are  of  comparable  magnitude.  Thus  one  sees  clearly  in  what  situations  the 
fiducial  argument  is  relevant,  and  what  the  nature  of  its  limitations  are.  See 
Hill  (1974,  p.570)  for  a  mathematical  discussion  of  the  behavior  of  a  posterior 
distribution  for  various  kinds  of  extreme  data. 

The  situation  with  regard  to  An  is  of  the  same  general  nature  as  that  for 
a  normal  mean,  except  it  is  much  more  complicated.  Thus  the  primary  basis 
for  An  or  Hn  is  (8),  which  really  says  that  the  observed  values  x,  are  totally 
uninformative  about  J.  Once  again  the  initial  intuition  comes  from  a  form  of 
Fisher’s  fiducial  argument.  Even  if  one  finds  his  argument  compelling,  how¬ 
ever,  one  would  presumably  want  to  put  it  into  a  broader  context,  including  at 
least  sampling  without  replacement  and  the  case  of  ties.  Thus  the  generaliza¬ 
tion  from  An  to  Hn,  and  from  sampling  with  replacement  to  sampling  from  a 
finite  population  without  replacement,  is  important,  since  the  case  of  ties  and 
of  sampling  without  replacement  is  both  more  fundamental  and  more  realistic. 
The  Bayesian  approach  does  not  stop  here,  however,  for  the  underlying  assump¬ 
tions,  especially  (8),  are  themselves  only  approximations.  They  are  extremely 
valuable,  because  without  such  approximations  there  is  nothing  that  one  can 
do  in  a  rational  and  logical  way.  But  as  with  all  things,  they  too  are  only 
approximations,  and  the  trick  is  to  learn  when  they  are  appropriate. 

Finally,  because  An  is  a  de  Finetti  coherent  procedure,  one  knows  that  there 
are  no  operationally  meaningful  ways  in  which  one  can  be  made  a  loser  by  using 
An  or  Hn-  Obviously,  this  property  is  a  desirable  one,  but  it  is  not  sufficient 
to  justify  use  of  these  procedures.  Thus  in  addition  to  the  internal  coherency 
property,  one  wants  also  to  know  whether  the  procedure  is  ‘reasonable,’  that  is 
to  say,  whether  it  corresponds  to  prior  knowledge  that  is  generally  considered 
to  be  appropriate  for  the  situation  at  hand,  or  for  which  a  reasonable  case  can 
be  made.  In  my  opinion,  it  ordinarily  is,  but  with  a  few  qualifications. 

Let  me  conclude  by  observing  that  A„  is  supported  by  all  of  the  major 
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approaches  to  statistical  inference.  It  is  Bayesian,  fiducial,  and  even  a  confi¬ 
dence/tolerance  procedure.  It  is  simple,  coherent,  and  plausible.  It  can  even  be 
argued,  I  believe,  that  when  viewed  in  the  context  of  Bayesian  data  analysis, 

An ,  along  with  Hn ,  constitute  the  best  solution  we  now  have  to  the  problem  of 
induction  as  formulated  by  Hume. 
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